Add prepare command #38

tscholak · 2024-11-10T04:04:42Z

✨ Description

Extracted and refined the dataset preparation script from #17.
Made it a command like train or convert.
Example call and config:

fast-llm prepare gpt_memmap --config foo.yaml

or

torchrun --standalone --nnodes 1 --nproc_per_node=1 --no_python \
    fast-llm prepare gpt_memmap --config foo.yaml

where foo.yaml contains:

output_path: /tmp/foo

loading_workers: 4
tokenize_workers: 4
saving_workers: 4

dataset:
  name_or_path: stas/openwebtext-10k

tokenizer:
  path: /tmp/SmolLM-135M/tokenizer.json

Run git clone https://huggingface.co/HuggingFaceTB/SmolLM-135M in tmp to get that tokenizer file.

This will produce:

/tmp/foo
├── downloaded_dataset
│   ├── cache-1e5559f36da9962e_00002_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00003_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00001_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00000_of_00004.arrow
│   ├── data-00001-of-00004.arrow
│   ├── dataset_info.json
│   ├── data-00000-of-00004.arrow
│   ├── data-00002-of-00004.arrow
│   ├── data-00003-of-00004.arrow
│   ├── ok
│   └── state.json
├── shard_0_0.idx
├── shard_0_0.bin
└── fast_llm_dataset.json

with fast_llm_dataset.json reading:

{
    "datasets": [
        {
            "prefix": "shard_0_0",
            "num_documents": 10000,
            "num_tokens": 11569536,
            "weight": 1.0
        }
    ]
}

The downloaded_dataset can be deleted afterwards. It is not used by Fast-LLM.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

Fixed memory-mapped indexed dataset and added round-trip tests
Added prepare_dataset command
Simplified Dockerfile

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General:

📜 I have read and followed the contributing guidelines.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration:

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing:

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact:

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

N/A

📝 Additional Notes

N/A

tscholak · 2024-11-11T13:23:59Z

Note to self:

Don't forget to enforce the use of hf_transfer for downloading!
Test multi-node processing with GLOO, https://pytorch.org/docs/stable/distributed.html.

tscholak · 2024-11-11T13:29:16Z

@jlamypoirier

I introduced some classes so that we can have different data preparations for different models. This class hierarchy should eventually go somewhere else, depending on what the outcome of [Prototype] Flexible dataset configuration #34 will be. I only made one implementation for GPTs so far. We need another one for VLMs, which can be extracted from StarDoc model training #5. I'm coordinating with @akshaykalkunte to do just that.
About the changes to the Dockerfile: I realized that we can install Fast-LLM in editable mode globally if we make /app writable for everyone. I therefore removed the fast-llm user. This also resolves a problem where fast-llm isn't usable if the user's id in the job doesn't match the id of the fast-llm user of the image. These changes are part of this PR because I needed them to test data preparation on Toolkit with the images created by the CI action. On Toolkit, the user has the id 13013 whereas CI builds the image with user id 1000. I don't think we can make any assumptions about the environment in which the official Fast-LLM image will be deployed, which is why it's better to remove user creation altogether.
I looked into using Fast-LLM's existing Distributed and DistributedConfig but found them too difficult to adapt to this straightforward CPU-only use case. I do not want to have to deal with distributed dims or CUDA rng initializations for running this simple data preparation code on multiple nodes. Users shouldn't have to bring GPUs for data processing if they aren't needed.

fast_llm/data/config.py

Dockerfile

setup.cfg

tests/test_memmap_dataset.py

jlamypoirier · 2024-11-11T17:16:37Z

tests/test_memmap_dataset.py

+from fast_llm.data.gpt.memmap import GPTMemmapDataset
+import pytest
+
+def dtype_arrays(dtype: np.dtype, min_size: int=1, max_size: int=100) -> st.SearchStrategy:


I'm not following what the hypothesis module brings here. You seem to be just creating a list of random arrays, is that right? This can easily be done in plain numpy with the same function complexity.

The benefit is that hypothesis will try to shrink the inputs to the minimal reproducible value in case of a problem

fast_llm/tools/prepare_dataset.py

tests/test_memmap_dataset.py

jlamypoirier

LGTM, assuming my proposed modifications are ok

tscholak added 4 commits November 9, 2024 11:47

fix GPTMemmapDataset

7304119

fix GPTMemmapDataset

47d453b

add prepare-dataset command

bef3a72

add prepare-dataset command

0ffc75c

tscholak requested a review from jlamypoirier November 10, 2024 04:39

tscholak added 8 commits November 10, 2024 09:42

add prepare-dataset command

fda6386

add prepare-dataset command

acae7d9

add prepare-dataset command

eb7da59

add prepare-dataset command

b5ed2f0

only push latest tag for commits to main

c8f746a

use older generics syntax

e0f813c

remove user and install Fast-LLM globally

b88c9d3

simplify Dockerfile

4df12d9

tscholak mentioned this pull request Nov 11, 2024

StarDoc model training #5

Closed

jlamypoirier requested changes Nov 11, 2024

View reviewed changes

tscholak added 13 commits November 11, 2024 18:14

improvements

3737bc0

add docstring

4b6b195

use full imports

52a6f0b

use full imports

55b0b88

use full imports

1f975d2

don't load tokenizer during validatin

b665e91

Merge remote-tracking branch 'origin/main' into tscholak/prepare-dataset

af1439e

simplify

e51677f

simplify

1f447bb

address comments

fb50c13

address comments

33067c8

address comments

dbc221c

address comments

a2ae051

jlamypoirier added 4 commits November 12, 2024 16:16

fixes

81162b3

fix

a134a52

No venv

fbb011a

Faster tests

4827f49

jlamypoirier approved these changes Nov 12, 2024

View reviewed changes

tscholak added 2 commits November 12, 2024 20:37

use dtype

f8c328f

remove unused venv package

ded3027

tscholak merged commit 2905d38 into main Nov 13, 2024
2 checks passed

tscholak deleted the tscholak/prepare-dataset branch November 13, 2024 01:38

tscholak changed the title ~~Add prepare_dataset command~~ Add prepare command Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add prepare command #38

Add prepare command #38

Uh oh!

tscholak commented Nov 10, 2024 •

edited

Loading

Uh oh!

tscholak commented Nov 11, 2024 •

edited

Loading

Uh oh!

tscholak commented Nov 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier Nov 11, 2024

Uh oh!

tscholak Nov 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add prepare command #38

Add prepare command #38

Uh oh!

Conversation

tscholak commented Nov 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General:

Dependencies and Configuration:

Testing:

Performance Impact:

📊 Performance Impact Details

📝 Additional Notes

Uh oh!

tscholak commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tscholak commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier Nov 11, 2024

Choose a reason for hiding this comment

Uh oh!

tscholak Nov 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tscholak commented Nov 10, 2024 •

edited

Loading

tscholak commented Nov 11, 2024 •

edited

Loading

tscholak commented Nov 11, 2024 •

edited

Loading